action localization
VideoCapsuleNet: A Simplified Network for Action Detection
Kevin Duarte, Yogesh Rawat, Mubarak Shah
Wepropose a 3D capsule network for videos, called VideoCapsuleNet: a unified network for action detection which can jointly perform pixel-wise action segmentation along with action classification. The proposed network is a generalization of capsule network from 2D to 3D, which takes a sequence of video frames as input. The 3D generalization drastically increases the number of capsules in the network, making capsule routing computationally expensive.
MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection
Lu, Hui, Yu, Yi, Lu, Shijian, Rajan, Deepu, Ng, Boon Poh, Kot, Alex C., Jiang, Xudong
Abstract--T emporal Action Detection (T AD) aims to identify and localize actions by determining their starting and ending frames within untrimmed videos. Recent Structured State-Space Models such as Mamba have demonstrated potential in T AD due to their long-range modeling capability and linear computational complexity. On the other hand, structured state-space models often face two key challenges in T AD, namely, decay of temporal context due to recursive processing and self-element conflict during global visual context modeling, which become more severe while handling long-span action instances. This paper presents MambaT AD, a new state-space T AD model that introduces long-range modeling and global feature detection capabilities for accurate temporal action detection. MambaT AD comprises two novel designs that complement each other with superior T AD performance. First, it introduces a Diagonal-Masked Bidirectional State-Space (DMBSS) module which effectively facilitates global feature fusion and temporal action detection. Second, it introduces a global feature fusion head that refines the detection progressively with multi-granularity features and global awareness. In addition, MambaT AD tackles T AD in an end-to-end one-stage manner using a new state-space temporal adapter(SST A) which reduces network parameters and computation cost with linear complexity. Extensive experiments show that MambaT AD achieves superior T AD performance consistently across multiple public benchmarks. Emporal action detection (T AD) aims to detect specific action categories and extract corresponding temporal spans in untrimmed videos. It is a long-standing and challenging problem in video understanding with extensive real-world applications such as sports analysis, surveillance and security. The development of deep neural networks such as CNNs [1], [2] and Transformers [3], [4] has led to continuous advancements in T AD performance over the past few years. However, CNNs have limited capabilities in capturing long-range dependencies, while Transformers face challenges with computational complexity and feature discrimination [1]. Hui Lu and Yi Y u are with the Rapid-Rich Object Search Lab, Interdisciplinary Graduate Programme, Nanyang Technological University, Singapore, (e-mail: {hui007, yuyi0010}@e.ntu.edu.sg).
- Asia > Singapore (0.24)
- Europe > Austria > Vienna (0.14)
- North America > United States (0.04)
- (2 more...)
- Leisure & Entertainment (0.46)
- Information Technology (0.46)
- North America > United States > Florida > Orange County > Orlando (0.14)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
184260348236f9554fe9375772ff966e-Reviews.html
"NIPS 2013 Neural Information Processing Systems December 5 - 10, Lake Tahoe, Nevada, USA",,, "Paper ID:","1139" "Title:","Action is in the Eye of the Beholder: Eye-gaze Driven Model for Spatio-Temporal Action Localization" Reviews First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. This paper proposes a method for action detection (localization and classification of actions) using weakly supervised information (action labels + eye gaze information, no explicit definition of bounding boxes). Overall, the spatio-temporal search (a huge spatio-temporal space) is done using dynamic programming and a max-path algorithm. Gaze information is introduced into framework through a loss which acounts for gaze density at a given location. QUALITY: The paper seems technically sound and makes for a nice study given gaze information.
- North America > United States > Virginia (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)